The Misuse & Abuse of P-Values in Analytics
What is all the fuss about?
# set random seed
np.random.seed(42)
# sample size
N = 200
# generate predictors + noise
x1, x2, x3, noise = [np.random.normal(size=N) for _ in range(4)]
# outcome with specified effects
y = 1 + 0.1 * x1 + 0.5 * x2 + 0.01 * x3 + noise
# create dataframe
df = pd.DataFrame(dict(y=y, x1=x1, x2=x2, x3=x3))
# fit ols model
X = sm.add_constant(df[['x1', 'x2', 'x3']])
lm_results = sm.OLS(df.y, X).fit()
# render regression table
create_regression_table(lm_results)| Outcome = y | ||||
|---|---|---|---|---|
| Estimate (95% CI) |
Std. Error | t-statistic | p-value | |
| Intercept | 1.03*** [0.89, 1.18] |
0.07 | 14.29 | 0.000 |
| x1 | 0.2* [0.05, 0.35] |
0.08 | 2.55 | 0.011 |
| x2 | 0.39*** [0.25, 0.53] |
0.07 | 5.34 | 0.000 |
| x3 | 0.13 [-0.02, 0.27] |
0.07 | 1.75 | 0.081 |
| * p<0.05; ** p<0.01; *** p<0.001 | ||||
| R2=0.167; Adj. R2=0.155 | ||||
Ronald A. Fisher
Using simulations to demonstrate issues with p-values
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(
data=p_values.filter(pl.col("N") == 50).to_pandas(),
x="p", hue="effect", palette=colors,
alpha=0.7, bins=20, edgecolor=".3"
)
ax.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)
ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")
plt.xlim(0, 1)
plt.tight_layout()
plt.show()
report_metrics(p_values, n=50)Simulation Results:
- Sample Size Per Group: 50
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.18 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.50 vs 0.35
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(
data=p_values.filter(pl.col("N") == 100).to_pandas(),
x="p", hue="effect", palette=colors,
alpha=0.8, bins=20, edgecolor=".3"
)
ax.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)
ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")
plt.xlim(0, 1)
plt.tight_layout()
plt.show()
report_metrics(p_values, n=100)Simulation Results:
- Sample Size Per Group: 100
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.30 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.50 vs 0.29
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(
data=p_values.filter(pl.col("N") == 500).to_pandas(),
x="p", hue="effect", palette=colors,
alpha=0.7, bins=20, edgecolor="black"
)
plt.axvline(x=0.05, color='#D93649', linestyle='--', linewidth=5)
ax.get_legend().set_title("Effect Size")
plt.ylabel("Simulations")
plt.xlabel("P-Value")
plt.xlim(0, 1)
plt.tight_layout()
plt.show()
report_metrics(p_values, n=500)Simulation Results:
- Sample Size Per Group: 500
- Simulations Per Condition: 1000
- False Positive Rate (H₀): 0.05 (Expected: 0.05)
- Statistical Power (H₁): 0.89 (Target: 0.8)
- Mean p-value (H₀ vs H₁): 0.51 vs 0.03
expected = alpha * n_tests
fig, ax = plt.subplots(figsize=(8, 6))
sns.histplot(
data=multiple_comparisons.to_pandas(), x="false_positives",
bins=range(0, 12), color="#005EB8", edgecolor="black"
)
plt.axvline(x=expected, color='#D93649', linestyle='--', linewidth=5)
plt.xlabel("False Positives (p < 0.05)")
plt.xticks(range(0, 11, 1))
plt.legend([f"Expected ({n_tests} × {alpha} = {expected})"])
plt.tight_layout()
plt.show()
report_multiple_comparisons(multiple_comparisons, n_tests, n_sims, alpha)Simulation Results:
- Average number of false positives: 2.51
- Probability of at least one false positive: 92.8%
- Expected number of false positives: 2.5
Recommendations for using p-values
summary = (
p_values
.filter(pl.col("effect") == "0.2")
.with_columns((pl.col("p") > 0.05).alias("non_sig"))
.group_by("N")
.agg([
pl.col("non_sig").sum().alias("non_sig_count"),
pl.len().alias("total")
])
.with_columns(
(pl.col("non_sig_count") /
pl.col("total")).alias("rejection_rate")
)
.select(["N", "non_sig_count", "rejection_rate"])
.sort("N")
)
format_power_table(summary)| Sample Size | Rejections | Proportion |
|---|---|---|
| 10 | 925 | 0.93 |
| 20 | 908 | 0.91 |
| 50 | 817 | 0.82 |
| 100 | 701 | 0.70 |
| 250 | 392 | 0.39 |
| 500 | 113 | 0.11 |
| 1000 | 5 | 0.01 |
| Effect = 0.2, 1000 Simulations | ||
Summarising what we’ve learned
Contact:
Code & Slides:
Paul Johnson // Matters of Significance // May 22, 2025